This project focuses on replicating aspects of the paper “Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq” by Tirosh et al. The biological question of interest is the exploration of the distinct genotypic and phenotypic states of melanoma tumors. There is a need for a deeper understanding of melanoma composition and its effect on the clinical course as the cellular composition of each tumor may exert critical roles in cancer development. Furthermore, classification of tumor heterogeneity may inform treatment through immunological or other mechanisms. This is particularly important in tumors with few treatment options, or in cases where tumors are likely to recur.
The paper “used a multistep approach to distinguish the different cell types within melanoma tumors on the basis of both genetic and transcriptional states,” which we attempted to replicate. For their first step and one of this project’s tasks, large-scale copy number variations (CNVs) were inferred from expression profiles by averaging expression over stretches of 100 genes on their respective chromosomes. For the next task, the cells were grouped according to their expression profiles using t-SNE (nonlinear dimensionality reduction). In addition to t-SNE, we attempted dimensionality reduction using UMAP and clustering with SIMLR. The paper found that in general, “cells designated as malignant by CNV analysis formed a separate cluster for each tumor, suggesting a high degree of intertumor heterogeneity.” The non-malignant cells, however, clustered by cell type and not by tumor or metastatic site.
We utilized the publicly available dataset GSE72056 containing 4645 single-cell RNA-seq profiles measured from malignant, immune, and stromal cells, isolated from 19 patients that span a range of clinical and therapeutic backgrounds. For the malignant cells, only tumors with >50 malignant cells were included. The paper included 6 tumors on this basis, but we found 8 tumors satisfying the condition, depending on ordering of filtering criteria. For the non-malignant cells, only tumors with >100 non-malignant cells were included. Using this criteria, the paper included 11 tumors in the main figure while we included 12.
Before beginning the analysis, we used the Seurat package to perform quality checks on the data. The ‘SCTransform’ function scaled the residuals to have unit variance and centered the residuals to have mean zero. However, the data was already pre-processed and using centered and scaled data resulted in worse performance compared to using just centered data. ‘SCTransform’ was also used to find the top 300 and 3000 variable genes, used to subset the data is later analysis. As will be seen later, using data containing only the 300 most variable genes improved analysis compared to using only the 3000 most variable genes and also compared to using the complete data.
For the first task, we attempted to replicate the following plot using InferCNV, which is aimed at inferring the copy number variarions from tumor single cell RNA-seq data. In addition, we use six tumors as they choose in tSNE plot in this paper, and put their InferCNV results together to see if there are any common alterations.
Copy number variations (CNVs) are genomic alterations, with abnormal deletions or amplifications of one or more gens on segments of chromosomes. And InferCNV is applied to explore the tumor single cell RNA-seq data, detect chromosomal copy number alterations and identify deletions or amplifications of malignant cell gene expression by comparing it to a set of non-malignant cells.
A heatmap is generated directly from the InferCNV for each run, illustrating the relative expression intensities across each chromosome. From the plot, it’s apparent to see which regions of the tumor genome are over-abundant or less-abundant as compared to that of non-malignant cells.
Advanced application of InferCNV also includes methods to predict CNA regions and define cell subclusters according to patterns of heterogeneity, and you can run it step by step for more exploratory purposes, which is not included in this report. Here we just use the standard pipeline to replicate the figures in paper and compare the results.
Plot caption: Selected tumor cells are shown below with individual cells (y axis) and chromosomal regions (x axis). Amplifications (red) or deletions (blue) are inferred by averaging expression over 100-gene stretches on the respective chromosomes.
Use non-malignant cells as reference cells for Mel80:
Mel80 InferCNV plot shows strong amplifications at chromosome 1, 2, 3, 7, 8, 15, 19, 20, 21 and 22. It shows large deletions at chromosome 4, 6, 9, 10, 14 and 19. Most of the figures recapitulate the results as shown in the paper and are also consistent with the whole exon sequencing data. There is some discrepancy between our plot and the plot in the paper regarding copy number variations on chromosome 8 and 22. Our data shows some amplification on chromosome 22. According to literature, there are some gene loci including the region where EP300 gene is located on chromosome 22 often shows amplification in RAS mutant tumors. Mel 80 has NRAS Q61L mutation. This observation suggests our result could be real. There might be some error during WES or in the analysis of the paper. To address the concern regarding the differential results, it would be very informative to compare our result with the TCGA melanoma CNV data[1].
We also try to plot heatmap for Mel78 (also non-malignant cells as reference):
As shown above, we get almost identical result as shown in the supplementary materials of the paper and it is also consistent with the whole exon sequencing data. It shows strong amplification at chromosome 1, 2, 3, 10, 11, 14, 21 and 22. It shows deletions at chromosome 3, 4, 5, 6, 11, 12, 13, 17, 18, 19 and X.
Comparing the InferCNV plot of Mel 78 and 80, we find that they share some common copy number variations. For example, they both have amplification on chromosome 1, 2, 3, 7, 19, 21 and 22, and deletions on chromosome 4, 5, 6, 19 and X. This might because both of them are RAS mutant tumors.
Overall, we analyzed 6 tumors by InferCNV. As shown below, 5 out of the 6 tumors belong to 3 genomic subtypes of melanoma. Mel80, 78 and 88 all belong to RAS mutant subtype. Mel 81 has BRAF mutation. Mel 79 is wildtype. The subtype of Mel 89 is not identified. Comparing the results of 6 tumors, we can see obvious intertumoral heterogeneity, that each tumor has its own copy number variations. However, we can also see similarities shared among different tumors, especially the ones that have the same subtype. For example, Mel78, Mel 80 and Mel 88 all have amplification of chromosome 22, further supporting our earlier analysis done for Mel 80 is true.
By clustering all the 6 tumors together, we can identify some amplifications and/or deletions common to all of the tumors, such as amplifications at chromosome 1-3, deep deletions at chromosome 6 and 10, suggesting these regions might play a very important role in the progression of melanoma. MITF, a master regulator of melanocyte and also an oncogene is located at chromosome 3. The well-known tumor suppressor gene PTEN is located at chromosome 10. Further detailed study of these commonly and /or differentially altered genomic regions could identify potential oncogenes or tumor suppressor genes involved in the development of melanoma and give guidance to targeted therapies or immune therapies.
For the second task, we attempted to replicate the following plots using various visualization and clustering methods. Malignant and non-malignant cell-types were distinguished using single-cell expression profiles. Clusters of non-malignant cells are marking by dashed ellipses and were annotated as T cells, B cells, macrophages, endothelial cells, cancer-associated fibroblasts, and natural killer cells.
T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique used for visualizing high-dimensional data such as a single-cell gene expression data into low dimensional space. t-SNE allows us to visualize the relationships between cells based on their ensemble gene expression profiles.
We applied t-SNE to the single-cell RNA-seq data from Tirosh et al., separating malignant from non-malignant cells as determined from inferCNV. First, we loaded the data into R, separating the phenotypic and feature data from the single-cell expression matrix:
## Read in data, separate into phenotypic and feature data tables ####
dat <- fread("data/GSE72056_melanoma_single_cell_revised_v2.txt")
edat <- as.matrix(dat[4:nrow(dat), 2:ncol(dat)])
fdat <- data.table(genes=unlist(dat[4:nrow(dat),1]))
fdat$genes[17008] <- paste0(fdat$genes[17008], "_1")
fdat$genes[23202] <- paste0(fdat$genes[23202], "_2")
fdat$genes[8027] <- paste0(fdat$genes[8027], "_1")
fdat$genes[23637] <- paste0(fdat$genes[23637], "_2")
pdat <- data.table(
cellID=colnames(dat)[2:length(colnames(dat))],
tumor=unlist(dat[1,2:ncol(dat)]),
mal=unlist(dat[2,2:ncol(dat)]), #malignant(1=no, 2=yes, 0=unresolved)
cell_type=unlist(dat[3,2:ncol(dat)]) #non-malignant cell type (1=T, 2=B, 3=Macro., 4=Endo., 5=CAF; 6=NK)
)
rownames(edat) <- fdat$genes
To reduce the complexity of the t-SNE visualization, we restricted the number of tumors used in the analysis according to Tirosh et al. For the non-malignant cell analysis, tumors with at least 100 cells were used:
## Tumors with at least 100 cells
filteredTumors1 <- pdat[,.N, by = tumor] %>%
filter(., N >= 100) %>%
select(., tumor) %>%
pull(., tumor)
Tumors were filtered a second time to include only those with more than 50 malignant cells for the malignant cell analysis:
## Tumors with >50 malignant cells
filteredTumors2 <- pdat[mal ==2 & tumor %in% filteredTumors1, .N, by = tumor] %>%
filter(., N > 50) %>%
select(., tumor) %>%
pull(., tumor)
These filters were used to subset the expression data matrix into nonMalignant and malignant matricies, respectively.
## Define malignant and non-Malignant Cells ####
nonMalignant <- edat[,pdat[tumor %in% filteredTumors1 & mal == 1, cellID]]
malignant <- edat[,pdat[tumor %in% filteredTumors2 & mal == 2, cellID]]
After subsetting the data, we ran the seurat pipeline which involves the following steps:
LogNormalize functionThese steps can be condensed into a single function:
## For Convenience these steps can be condensed into a single function ####
run_seurat <- function(exprsMatrix){
## Create Seurat Object ####
sobj <- CreateSeuratObject(raw.data = exprsMatrix)
## Normalization ####
sobj <- NormalizeData(sobj, normalization.method = 'LogNormalize', scale.factor = 10000)
## Finding Highly Variable Genes ###
sobj <- FindVariableGenes(sobj,
mean.function = ExpMean,
dispersion.function = LogVMR,
x.low.cutoff = 0.0125,
x.high.cutoff = 3,
y.cutoff = 0.5)
## Scaling data to regress out confounders (detected molecules per cell) ####
sobj <- ScaleData(sobj, vars.to.regress = c('nUMI'))
## Dimensionality Reduction ####
sobj <- RunPCA(sobj, pc.genes = sobj@var.genes, do.print = T, pcs.print = 1:5, genes.print = 5)
## Clustering and tSNE Plot ####
sobj <- FindClusters(sobj, reduction.type = "pca", dims.use = 1:8, resolution = 1.0, print.output = 0, save.SNN = T)
sobj <- RunTSNE(sobj, dims.use = 1:8)
TSNEPlot(sobj)
return(sobj)
}
Below are expression QC plots for malignant and non-malignant cells, respectively, that visualize and plot both gene and expression counts for all cells.
We ran the seurat pipeline on the malignant, non-malignant, and all cells combined using the function with the parameters shown above. The resulting t-SNE embeddings were extracted along with the phenotypic data (i.e. tumor and cell_type). The results were plotted and colored according to all different combinations of the phenotypic data:
## Run Seurat Pipeline ####
sobjM <- run_seurat(malignant)
sobjN <- run_seurat(nonMalignant)
sobjC <- run_seurat(edat)
## Extract tSNE Data for Malignant Cells ####
mtSNE <- data.table(
cellID = rownames(sobjM@dr$tsne@cell.embeddings),
tSNE_1 = sobjM@dr$tsne@cell.embeddings[,1],
tSNE_2 = sobjM@dr$tsne@cell.embeddings[,2],
tumor = factor(pdat$tumor[pdat$cellID %in% colnames(malignant)]),
cell_type = factor(pdat$cell_type[pdat$cellID %in% colnames(malignant)])
)
## Extract tSNE Data for non-Malignant Cells ####
ntSNE <- data.table(
cellID = rownames(sobjN@dr$tsne@cell.embeddings),
tSNE_1 = sobjN@dr$tsne@cell.embeddings[,1],
tSNE_2 = sobjN@dr$tsne@cell.embeddings[,2],
tumor = factor(pdat$tumor[pdat$cellID %in% colnames(nonMalignant)]),
cell_type = factor(pdat$cell_type[pdat$cellID %in% colnames(nonMalignant)])
)
## Extract tSNE Data for All Cells ####
ctSNE <- data.table(
cellID = rownames(sobjC@dr$tsne@cell.embeddings),
tSNE_1 = sobjC@dr$tsne@cell.embeddings[,1],
tSNE_2 = sobjC@dr$tsne@cell.embeddings[,2],
tumor = factor(pdat$tumor[pdat$cellID %in% colnames(edat)]),
cell_type = factor(pdat$cell_type[pdat$cellID %in% colnames(edat)])
)
To identify the cluster groups for the non-malignant cells we used Density-Based Spatial Clustering and Application with Noise (DBSCAN) on the t-SNE visualization as in Tirosh et al. The plot below shows the t-SNE before and after coloring by the clusters found with DBSCAN. In Tirosh et al. they used DBSCAN parameters eps = 6 and MinPts = 10, but we were unable to recover the same clusters using these parameters, but eps = 2 and MinPts = 10 recovered many of the clusters.
## Identifying Clusters with DBScan ####
kNNdistplot(sobjN@dr$tsne@cell.embeddings, k=10)
abline(h=2)
db <- fpc::dbscan(sobjN@dr$tsne@cell.embeddings, eps = 2, MinPts = 10)
In Tirosh et al., the significant marker genes were identified to determine cell identity of each cluster. We took the genes for each cell type defined in the supplement and created gene signatures to identify the same clusters in our t-SNE plots. The mean expression value of each signature for each cell was computed and used to identify each cluster as shown below:
## Compile Gene Signature Markers and Make Feature Plots ####
markers <- list(
tcells=c('CD2','CD3D','CD3E','CD3G','CD8A','SIRPG','TIGIT','GZMK','ITK','SH2D1A','CD247','PRF1','NKG7',
'IL2RB','SH2D2A','KLRK1','ZAP70','CD7','CST7','LAT','PYHIN1','SLA2','STAT4','CD6','CCL5','CD96',
'TC2N','FYN','LCK','TCF7','TOX','IL32','SPOCK2','SKAP1','CD28','CBLB','APOBEC3G','PRDM1'),
bcells=c("CD19", "CD79A", "CD79B", "BLK", "MS4A1", "BANK1", "IGLL3P",
"FCRL1", "PAX5", "CLEC17A", "CD22", "BCL11A", "VPREB3", "HLA-DOB",
"STAP1", "FAM129C", "TLR10", "RALGPS2", "AFF3", "POU2AF1", "CXCR5",
"PLCG2", "HVCN1", "CCR6", "P2RX5", "BLNK", "KIAA0226L", "POU2F2",
"IRF8", "FCRLA", "CD37"),
macro=c("CD163", "CD14", "CSF1R", "C1QC", "VSIG4", "C1QA", "FCER1G",
"F13A1", "TYROBP", "MSR1", "C1QB", "MS4A4A", "FPR1", "S100A9",
"IGSF6", "LILRB4", "FPR3", "SIGLEC1", "LILRA1", "LYZ", "HK3",
"SLC11A1", "CSF3R", "CD300E", "PILRA", "FCGR3A", "AIF1", "SIGLEC9",
"FCGR1C", "OLR1", "TLR2", "LILRB2", "C5AR1", "FCGR1A", "MS4A6A",
"C3AR1", "HCK", "IL4I1", "LST1", "LILRA5", "CSTA", "IFI30", "CD68",
"TBXAS1", "FCGR1B", "LILRA6", "CXCL16", "NCF2", "RAB20", "MS4A7",
"NLRP3", "LRRC25", "ADAP2", "SPP1", "CCR1", "TNFSF13", "RASSF4",
"SERPINA1", "MAFB", "IL18", "FGL2", "SIRPB1", "CLEC4A", "MNDA",
"FCGR2A", "CLEC7A", "SLAMF8", "SLC7A7", "ITGAX", "BCL2A1", "PLAUR",
"SLCO2B1", "PLBD1", "APOC1", "RNF144B", "SLC31A2", "PTAFR", "NINJ1",
"ITGAM", "CPVL", "PLIN2", "C1orf162", "FTL", "LIPA", "CD86",
"GLUL", "FGR", "GK", "TYMP", "GPX1", "NPL", "ACSL1"),
endo=c("PECAM1", "VWF", "CDH5", "CLDN5", "PLVAP", "ECSCR", "SLCO2A1",
"CCL14", "MMRN1", "MYCT1", "KDR", "TM4SF18", "TIE1", "ERG", "FABP4",
"SDPR", "HYAL2", "FLT4", "EGFL7", "ESAM", "CXorf36", "TEK", "TSPAN18",
"EMCN", "MMRN2", "ELTD1", "PDE2A", "NOS3", "ROBO4", "APOLD1",
"PTPRB", "RHOJ", "RAMP2", "GPR116", "F2RL3", "JUP", "CCBP2",
"GPR146", "RGS16", "TSPAN7", "RAMP3", "PLA2G4C", "TGM2", "LDB2",
"PRCP", "ID1", "SMAD1", "AFAP1L1", "ELK3", "ANGPT2", "LYVE1",
"ARHGAP29", "IL3RA", "ADCY4", "TFPI", "TNFAIP1", "SYT15", "DYSF",
"PODXL", "SEMA3A", "DOCK9", "F8", "NPDC1", "TSPAN15", "CD34",
"THBD", "ITGB4", "RASA4", "COL4A1", "ECE1", "GFOD2", "EFNA1",
"PVRL2", "GNG11", "HERC2P2", "MALL", "HERC2P9", "PPM1F", "PKP4",
"LIMS3", "CD9", "RAI14", "ZNF521", "RGL2", "HSPG2", "TGFBR2",
"RBP1", "FXYD6", "MATN2", "S1PR1", "PIEZO1", "PDGFA", "ADAM15",
"HAPLN3", "APP"),
caf=c("FAP", "THY1", "DCN", "COL1A1", "COL1A2", "COL6A1", "COL6A2",
"COL6A3", "CXCL14", "LUM", "COL3A1", "DPT", "ISLR", "PODN", "CD248",
"FGF7", "MXRA8", "PDGFRL", "COL14A1", "MFAP5", "MEG3", "SULF1",
"AOX1", "SVEP1", "LPAR1", "PDGFRB", "TAGLN", "IGFBP6", "FBLN1",
"CA12", "SPOCK1", "TPM2", "THBS2", "FBLN5", "TMEM119", "ADAM33",
"PRRX1", "PCOLCE", "IGF2", "GFPT2", "PDGFRA", "CRISPLD2", "CPE",
"F3", "MFAP4", "C1S", "PTGIS", "LOX", "CYP1B1", "CLDN11", "SERPINF1",
"OLFML3", "COL5A2", "ACTA2", "MSC", "VASN", "ABI3BP", "C1R",
"ANTXR1", "MGST1", "C3", "PALLD", "FBN1", "CPXM1", "CYBRD1",
"IGFBP5", "PRELP", "PAPSS2", "MMP2", "CKAP4", "CCDC80", "ADAMTS2",
"TPM1", "PCSK5", "ELN", "CXCL12", "OLFML2B", "PLAC9", "RCN3",
"LTBP2", "NID2", "SCARA3", "AMOTL2", "TPST1", "MIR100HG", "CTGF",
"RARRES2", "FHL2"),
nk=c("KLRB1", "KLRC1", "KLRD1", "KLRF1"),
cd8T=c("CD8A", "CD8B")
)
temp <- nonMalignant
tcell_sig <- sapply(1:ncol(temp), function(x) temp[which(fdat$genes %in% markers$tcells),x] %>% mean)
bcell_sig <- sapply(1:ncol(temp), function(x) temp[which(fdat$genes %in% markers$bcells),x] %>% mean)
macro_sig <- sapply(1:ncol(temp), function(x) temp[which(fdat$genes %in% markers$macro),x] %>% mean)
endo_sig <- sapply(1:ncol(temp), function(x) temp[which(fdat$genes %in% markers$endo),x] %>% mean)
caf_sig <- sapply(1:ncol(temp), function(x) temp[which(fdat$genes %in% markers$caf),x] %>% mean)
nk_sig <- sapply(1:ncol(temp), function(x) temp[which(fdat$genes %in% markers$nk),x] %>% mean)
cd8T_sig <- sapply(1:ncol(temp), function(x) temp[which(fdat$genes %in% markers$cd8T),x] %>% mean)
temp <- rbind(temp, tcell_sig, bcell_sig, macro_sig, endo_sig, caf_sig, nk_sig, cd8T_sig)
rownames(temp)[nrow(temp)]
## Run tSNE Pipeline ####
temp_obj <- run_seurat(temp)
## Feature Plots of 'Cell Markers' ####
p <- FeaturePlot(temp_obj, c("tcell_sig", "bcell_sig", "macro_sig", "endo_sig", "caf_sig", "nk_sig", "cd8T_sig"),
cols.use = c("lightgrey", "blue"), nCol = 4, no.legend = F, do.return = T)
Uniform Manifold Approximation and Project (UMAP) is a novel manifold learning technique for dimension reduction competitive with t-SNE for data visualization. Before running UMAP, linear dimensional reduction (principal components analysis) was performed. Then UMAP was run on both the untransformed complete data, and on the centered data containing only the 300 and 3000 most highly variable genes. The untransformed complete data plot demonstrates the importance of QC. The trasnformed and filtered data was performed using the Seurat pipeline. In the UMAP plots below, it can be seen that the malignant cells cluster by tumor origin. In all three cases, distinct clustering can be seen, although in varying degrees. Arguably, t-SNE performs the best in terms of creating distinct separation between clusters. In the t-SNE plot, the non-malignant cells appeared to cluster by cell-type. This is somewhat the case for UMAP run on the centered data using the top 300 most variable genes. However, the T cells form two separate clusters, while some of the other cell-types are lost within the T cells. The separatation of T cells into two clusters is supported by the feature plots shown previously. For UMAP run on the complete data and on the top 3000 most variable genes, one does not observe clustering by cell-type. Clustering by tumor is not apparent either.
Single-cell Interpretaion via Multi-kernal LeaRning (SIMLR) is a similarity-learning frameowrk that learns an appropriate distance metric from the data for purposes of dimension reduction, clustering, and visualization. SIMLR and SIMLR large scale was run on the transformed data using only the 300 most variable genes. SIMLR took approximately 16 minutes to run on the malignant cells, where as SIMLR large scale only took less than 5 seconds to run. The normalized mutual information (NMI) was computed between the clusters inferred by SIMLR large scale and the ground truth clusters. NMI scores range from 0 and 1, with higher values reflecting better performance. Again, in the t-SNE plots, malignant cells cluster by tumor origin. The same is true for clusters identified by SIMLR. The clusters found by SIMLR (NMI=0.6077) are well-separated but a few tumors were clustered together, as can be seen below. SIMLR large scale performed much better than SIMLR (NMI=0.8518) in clustering the malignant cells. SIMLR had the worst performance in clustering the non-malignant cells (NMI=0.5708). As in UMAP, SIMLR created two clusters for T cells while other cell-types were lost within the T cells. There is some clustering by cell-type, but it is not as apparent as in the t-SNE plots, and clustering by tumor is not apparent either.
[1] Genomic Classification of Cutaneous Melanoma. Cell. 2015 Jun 18;161(7):1681-96.